Move the country names to rownames (see Exercise 5.5). Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
library(tibble)
library(readr)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(tidyr)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
human_new <- read_csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/human2.csv")
## Rows: 155 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (8): Edu2.FM, Labo.FM, Life.Exp, Edu.Exp, GNI, Mat.Mor, Ado.Birth, Parli.F
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# no NA
any(is.na(human_new))
## [1] FALSE
# analyze and summary of variables
str(human_new)
## spc_tbl_ [155 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Country : chr [1:155] "Norway" "Australia" "Switzerland" "Denmark" ...
## $ Edu2.FM : num [1:155] 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num [1:155] 0.891 0.819 0.825 0.884 0.829 ...
## $ Life.Exp : num [1:155] 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ Edu.Exp : num [1:155] 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ GNI : num [1:155] 64992 42261 56431 44025 45435 ...
## $ Mat.Mor : num [1:155] 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num [1:155] 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num [1:155] 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
## - attr(*, "spec")=
## .. cols(
## .. Country = col_character(),
## .. Edu2.FM = col_double(),
## .. Labo.FM = col_double(),
## .. Life.Exp = col_double(),
## .. Edu.Exp = col_double(),
## .. GNI = col_double(),
## .. Mat.Mor = col_double(),
## .. Ado.Birth = col_double(),
## .. Parli.F = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(human_new)
## Country Edu2.FM Labo.FM Life.Exp
## Length:155 Min. :0.1717 Min. :0.1857 Min. :49.00
## Class :character 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30
## Mode :character Median :0.9375 Median :0.7535 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :83.50
## Edu.Exp GNI Mat.Mor Ado.Birth
## Min. : 5.40 Min. : 581 Min. : 1.0 Min. : 0.60
## 1st Qu.:11.25 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65
## Median :13.50 Median : 12040 Median : 49.0 Median : 33.60
## Mean :13.18 Mean : 17628 Mean : 149.1 Mean : 47.16
## 3rd Qu.:15.20 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95
## Max. :20.20 Max. :123124 Max. :1100.0 Max. :204.80
## Parli.F
## Min. : 0.00
## 1st Qu.:12.40
## Median :19.30
## Mean :20.91
## 3rd Qu.:27.95
## Max. :57.50
# Country is as rownames
human_new <- column_to_rownames(human_new, "Country")
# histogram per variable
pivot_longer(human_new, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# charts and correlations
ggpairs(human_new,lower=list(combo=wrap("facethist",binwidth=0.1)))
# correlation between variables
c <- cor(human_new)
corrplot(c, method="circle")
The distribution of these variables appears to deviate from a normal distribution.
HDI, Life expectancy, education exp, education mean is negatively correlated with GII, Mat mortality, adolescent birth. GII and maternity mortality are also negatively correlated with education for female and male.
HDI, Life expectancy, education exp, education mean is positively correlated with education of male and female. GII, adolescent birth and maternity mortality are also positively correlated.
> This observed correlations appears coherent, highlighting logical relationships between the various variables.
Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)
# PCA
pca_human <- prcomp(human_new)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# GNI has too big variances and is dominating the pca. The magnitude is too different between the metrics
Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)
# standardization of human_new
scaled_human <- scale(human_new)
# new PCA with scaled data
pca_human2 <- prcomp(scaled_human)
# draw a biplot of the PCA, adding captions that describe the phenomena
biplot(pca_human2, choices = 1:2,
xlab="PC1 - Country Socio-Economic Development", ylab="PC2 - Female Empowerment and Socio-Economic Development by Women", main = " Principal Component Analysis of Socio-Economic Indicators: Mapping Country Development Factors and Female Empowerment")
# Check for missing values
any(is.na(scaled_human))
## [1] FALSE
# Check for infinite values
any(is.infinite(scaled_human))
## [1] FALSE
PC1 - Country Socio-Economic Development. This component shows multiple socio-economic factors contributing to overall country development. It includes on one side, variables like Gross National Income, Human Development Index, Life Expectancy, Education, and on the other side, variables like Adolescent birth rate, Maternal mortality ratio. This axis reflects a broad aspect of a country’s overall progress and development.
PC2 - Female Empowerment and Socio-Economic Development by Women. This component specifically highlights aspects related to female empowerment, participation, and their impact on socio-economic development. It might incorporate variables such as Female Education, Female Labor Force Participation.
Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points):
In the first PCA, the GNI variable has too high metrics and too big variances vs the other variables. This variable dominates the PCA and therefore the PCA cannot be analyzed properly. The magnitude is too different between the metrics to compare them together without scaling them. After scaling the dataset, I did the PCA again to compare them more equally.
Analyze the standardized PCA: The arrows aligned with PC1 and its associated variables provide insight into the relationships between the variables themselves, it is the same for the PC2. The direction of the arrow suggests that the pointed variables have a significant and positive influence on the principal components represented in the plot. The direction of the arrow indicates which way the variable is correlated with the principal component. The length of the arrow represents the strength of the variable’s contribution to that principal component. Longer arrows indicate a stronger influence on that component.
On the left side of the chart, countries appear more developed, characterized by higher values in key development indicators such as Gross National Income, Life Expectancy at Birth, Human Development Index, Life Expectancy in Education, and Female Population with Secondary Education. These variables has significant influence, suggesting that the variables impact positively each other. As these metrics increase, the level of development tends to be higher. We can find in this area many European/Western countries. Some of them also have the characteristics of higher labor rate among females and a closer rate when compared to the labor rate for males (e.g Iceland, Norway, Denmark). Conversely, countries like Qatar, Bahrain, UAE, and Saudi Arabia display a relatively good development level but possess lower female labor participation, impacting the overall women development and freedom.
On the opposite side of the chart, variables such as Adolescent Birth Rate, Maternal Mortality Ratio, Male Labor Force Participation Rate, HDI Rank, and GNI Rank (where higher ranks denote lower HDI or GNI) portray countries facing more challenging situations. Higher values in these variables correlate with less favorable conditions in a country. As these indicators rise, the situation tends to be less favorable. These countries, primarily from the Global South, such as Burkina Faso, Haiti, Niger, and Sierra Leone, face lower overall development. Women situations also vary based on their positioning along PC2, reflecting the diverse and inequal experiences of women in different countries.
The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions).
In addition, some personal details were asked (4 questions). Load the tea dataset and convert its character variables to factors: tea <- read.csv(“https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv”, stringsAsFactors = TRUE)
Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data.
tea_new <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# View and summary of the dataset
view(tea_new)
str(tea_new)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea_new)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 +60 :38 +2/day :127
## middle :40 sportsman :179 15-24:92 1 to 2/week: 44
## non-worker :64 25-34:69 1/day : 95
## other worker:20 35-44:40 3 to 6/week: 34
## senior :35 45-59:61
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
# variables are factorial, except few ones. let's map the different categories thanks to the MCA
# But first let's keep only the categorical variables. An alternative would have been to change the non-catgeorical variables to categorical ones.
tea_new_2 <- tea_new %>% dplyr::select(-breakfast,-age)
On the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)
library(FactoMineR)
# MCA
mca <- MCA(tea_new_2, graph = FALSE)
mca
## **Results of the Multiple Correspondence Analysis (MCA)**
## The analysis was performed on 300 individuals, described by 34 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. of the categories"
## 4 "$var$cos2" "cos2 for the categories"
## 5 "$var$contrib" "contributions of the categories"
## 6 "$var$v.test" "v-test for the categories"
## 7 "$var$eta2" "coord. of variables"
## 8 "$ind" "results for the individuals"
## 9 "$ind$coord" "coord. for the individuals"
## 10 "$ind$cos2" "cos2 for the individuals"
## 11 "$ind$contrib" "contributions of the individuals"
## 12 "$call" "intermediate results"
## 13 "$call$marge.col" "weights of columns"
## 14 "$call$marge.li" "weights of rows"
# summary of the MCA
summary(mca)
##
## Call:
## MCA(X = tea_new_2, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.092 0.084 0.072 0.060 0.057 0.055 0.050
## % of var. 5.913 5.392 4.629 3.881 3.672 3.502 3.188
## Cumulative % of var. 5.913 11.305 15.934 19.815 23.488 26.989 30.177
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.049 0.048 0.045 0.042 0.042 0.039 0.038
## % of var. 3.138 3.050 2.885 2.675 2.672 2.491 2.424
## Cumulative % of var. 33.315 36.365 39.251 41.926 44.598 47.089 49.513
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.036 0.036 0.034 0.032 0.032 0.031 0.030
## % of var. 2.320 2.305 2.173 2.077 2.049 2.016 1.944
## Cumulative % of var. 51.833 54.138 56.311 58.387 60.436 62.453 64.397
## Dim.22 Dim.23 Dim.24 Dim.25 Dim.26 Dim.27 Dim.28
## Variance 0.029 0.027 0.026 0.026 0.025 0.024 0.024
## % of var. 1.880 1.761 1.692 1.669 1.624 1.553 1.544
## Cumulative % of var. 66.276 68.038 69.730 71.399 73.023 74.577 76.120
## Dim.29 Dim.30 Dim.31 Dim.32 Dim.33 Dim.34 Dim.35
## Variance 0.023 0.022 0.022 0.021 0.020 0.019 0.019
## % of var. 1.456 1.426 1.404 1.339 1.264 1.246 1.221
## Cumulative % of var. 77.576 79.002 80.406 81.745 83.009 84.255 85.476
## Dim.36 Dim.37 Dim.38 Dim.39 Dim.40 Dim.41 Dim.42
## Variance 0.017 0.017 0.017 0.016 0.016 0.015 0.015
## % of var. 1.120 1.112 1.078 1.028 1.005 0.968 0.939
## Cumulative % of var. 86.596 87.708 88.786 89.814 90.819 91.786 92.725
## Dim.43 Dim.44 Dim.45 Dim.46 Dim.47 Dim.48 Dim.49
## Variance 0.013 0.013 0.012 0.012 0.011 0.010 0.010
## % of var. 0.866 0.837 0.802 0.739 0.697 0.674 0.666
## Cumulative % of var. 93.590 94.427 95.229 95.968 96.666 97.339 98.005
## Dim.50 Dim.51 Dim.52 Dim.53
## Variance 0.009 0.008 0.007 0.006
## % of var. 0.603 0.531 0.458 0.403
## Cumulative % of var. 98.608 99.139 99.597 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.629 1.432 0.202 | 0.148 0.086 0.011 | 0.109 0.055
## 2 | -0.427 0.660 0.139 | 0.288 0.330 0.064 | -0.110 0.056
## 3 | 0.097 0.034 0.006 | -0.156 0.096 0.015 | 0.125 0.073
## 4 | -0.560 1.133 0.227 | -0.279 0.308 0.056 | -0.026 0.003
## 5 | -0.173 0.108 0.028 | -0.149 0.088 0.020 | 0.027 0.003
## 6 | -0.681 1.675 0.272 | -0.293 0.340 0.050 | -0.008 0.000
## 7 | -0.223 0.181 0.037 | 0.015 0.001 0.000 | 0.183 0.155
## 8 | -0.031 0.003 0.001 | 0.111 0.048 0.009 | -0.097 0.043
## 9 | -0.063 0.014 0.003 | 0.266 0.281 0.048 | 0.388 0.697
## 10 | 0.177 0.114 0.021 | 0.369 0.539 0.090 | 0.315 0.459
## cos2
## 1 0.006 |
## 2 0.009 |
## 3 0.009 |
## 4 0.000 |
## 5 0.001 |
## 6 0.000 |
## 7 0.025 |
## 8 0.007 |
## 9 0.103 |
## 10 0.066 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## Not.tea time | -0.552 4.238 0.236 -8.396 | 0.001 0.000 0.000 0.019 |
## tea time | 0.428 3.285 0.236 8.396 | -0.001 0.000 0.000 -0.019 |
## evening | 0.296 0.961 0.046 3.704 | -0.404 1.961 0.085 -5.051 |
## Not.evening | -0.155 0.503 0.046 -3.704 | 0.211 1.025 0.085 5.051 |
## lunch | 0.590 1.630 0.060 4.231 | -0.406 0.844 0.028 -2.907 |
## Not.lunch | -0.101 0.280 0.060 -4.231 | 0.070 0.145 0.028 2.907 |
## dinner | -1.047 2.447 0.082 -4.965 | -0.080 0.016 0.000 -0.379 |
## Not.dinner | 0.079 0.184 0.082 4.965 | 0.006 0.001 0.000 0.379 |
## always | 0.342 1.281 0.061 4.276 | -0.255 0.784 0.034 -3.193 |
## Not.always | -0.179 0.670 0.061 -4.276 | 0.134 0.410 0.034 3.193 |
## Dim.3 ctr cos2 v.test
## Not.tea time 0.063 0.071 0.003 0.959 |
## tea time -0.049 0.055 0.003 -0.959 |
## evening 0.324 1.469 0.055 4.052 |
## Not.evening -0.169 0.768 0.055 -4.052 |
## lunch 0.254 0.386 0.011 1.822 |
## Not.lunch -0.044 0.066 0.011 -1.822 |
## dinner 0.744 1.581 0.042 3.531 |
## Not.dinner -0.056 0.119 0.042 -3.531 |
## always 0.095 0.126 0.005 1.186 |
## Not.always -0.050 0.066 0.005 -1.186 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## tea.time | 0.236 0.000 0.003 |
## evening | 0.046 0.085 0.055 |
## lunch | 0.060 0.028 0.011 |
## dinner | 0.082 0.000 0.042 |
## always | 0.061 0.034 0.005 |
## home | 0.013 0.002 0.027 |
## work | 0.071 0.020 0.027 |
## tearoom | 0.326 0.021 0.027 |
## friends | 0.201 0.058 0.024 |
## resto | 0.155 0.007 0.030 |
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic",col.var = rainbow(ncol(tea_new_2)))
# It allows us to visualize many different categories from many catgeorical variables, all at once to see the categories that fit together and suggest association (similarities in behavior). For instance, unpackages and tea shop seem close to each other and separated from the rest of the categories. It suggests that they share some commonality / similar patterns or behavior in term of users preferences. Besides, they tend to occur together within the dataset.
# Select smaller dataset to do the MCA analyze with less variables
# let's now choose only few variables that could have correlations together but also show distinct behavior between users.
tea_new_2 %>%
pivot_longer(cols = everything()) %>%
ggplot(aes(x = value)) +
facet_wrap(~ name, scales = "free") +
geom_bar() +
labs(title = "Distribution of Categorical Variables") +
theme(axis.text.x = element_text(angle = 25, hjust = 1, size = 10)) +
theme(strip.text = element_text(size = 10))
# smaller data set - I will look at 2 different ones.
tea_smaller <- tea_new %>% dplyr::select(dinner, effect.on.health, frequency, evening, home, healthy, how, lunch, pub, price,Tea, where, work, sex)
tea_smaller2 <- tea_new %>% dplyr::select(diuretic, feminine, healthy, exciting, iron.absorption, escape.exoticism, relaxing, sophisticated, sex, slimming)
# MCA models
mca2 <- MCA(tea_smaller, graph = FALSE)
mca3 <- MCA(tea_smaller2, graph = FALSE)
# summary of the models
summary(mca2)
##
## Call:
## MCA(X = tea_smaller, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.154 0.151 0.109 0.100 0.092 0.088 0.081
## % of var. 9.395 9.176 6.648 6.066 5.627 5.331 4.944
## Cumulative % of var. 9.395 18.572 25.220 31.286 36.914 42.245 47.189
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13 Dim.14
## Variance 0.081 0.075 0.071 0.068 0.064 0.063 0.061
## % of var. 4.911 4.595 4.306 4.129 3.887 3.820 3.690
## Cumulative % of var. 52.100 56.695 61.000 65.129 69.016 72.836 76.526
## Dim.15 Dim.16 Dim.17 Dim.18 Dim.19 Dim.20 Dim.21
## Variance 0.059 0.054 0.050 0.046 0.045 0.040 0.035
## % of var. 3.589 3.297 3.035 2.806 2.766 2.413 2.110
## Cumulative % of var. 80.114 83.411 86.446 89.252 92.018 94.431 96.542
## Dim.22 Dim.23
## Variance 0.033 0.024
## % of var. 2.020 1.438
## Cumulative % of var. 98.562 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.251 0.136 0.026 | 0.542 0.649 0.120 | -0.166
## 2 | -0.235 0.120 0.070 | 0.139 0.042 0.024 | -0.231
## 3 | -0.181 0.070 0.019 | -0.077 0.013 0.003 | -0.043
## 4 | -0.158 0.054 0.015 | 0.447 0.442 0.124 | -0.200
## 5 | -0.107 0.025 0.013 | 0.020 0.001 0.000 | 0.383
## 6 | -0.266 0.153 0.029 | 0.673 1.001 0.185 | -0.216
## 7 | -0.175 0.066 0.029 | 0.098 0.021 0.009 | 0.148
## 8 | -0.125 0.034 0.014 | 0.139 0.042 0.017 | 0.213
## 9 | 0.447 0.431 0.163 | -0.239 0.126 0.046 | 0.076
## 10 | 0.559 0.675 0.229 | -0.326 0.234 0.077 | -0.156
## ctr cos2
## 1 0.085 0.011 |
## 2 0.164 0.068 |
## 3 0.006 0.001 |
## 4 0.122 0.025 |
## 5 0.447 0.173 |
## 6 0.142 0.019 |
## 7 0.067 0.021 |
## 8 0.139 0.039 |
## 9 0.018 0.005 |
## 10 0.075 0.018 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## dinner | 0.590 1.129 0.026 2.801 | 1.055 3.690 0.084 5.004
## Not.dinner | -0.044 0.085 0.026 -2.801 | -0.079 0.278 0.084 -5.004
## effect on health | 0.315 1.012 0.028 2.896 | 0.415 1.793 0.049 3.808
## No.effect on health | -0.089 0.286 0.028 -2.896 | -0.117 0.506 0.049 -3.808
## +2/day | 0.138 0.373 0.014 2.044 | -0.578 6.698 0.245 -8.562
## 1 to 2/week | 0.096 0.062 0.002 0.687 | 0.889 5.494 0.136 6.374
## 1/day | -0.311 1.416 0.045 -3.659 | 0.467 3.277 0.101 5.502
## 3 to 6/week | 0.229 0.275 0.007 1.416 | -0.298 0.477 0.011 -1.842
## evening | 0.132 0.276 0.009 1.648 | -0.277 1.246 0.040 -3.460
## Not.evening | -0.069 0.144 0.009 -1.648 | 0.145 0.651 0.040 3.460
## Dim.3 ctr cos2 v.test
## dinner | -0.266 0.323 0.005 -1.260 |
## Not.dinner | 0.020 0.024 0.005 1.260 |
## effect on health | 0.787 8.908 0.175 7.226 |
## No.effect on health | -0.222 2.513 0.175 -7.226 |
## +2/day | -0.225 1.405 0.037 -3.338 |
## 1 to 2/week | 0.948 8.623 0.155 6.797 |
## 1/day | -0.451 4.212 0.094 -5.309 |
## 3 to 6/week | 0.875 5.669 0.098 5.407 |
## evening | 0.433 4.202 0.098 5.409 |
## Not.evening | -0.226 2.197 0.098 -5.409 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## dinner | 0.026 0.084 0.005 |
## effect.on.health | 0.028 0.049 0.175 |
## frequency | 0.046 0.337 0.304 |
## evening | 0.009 0.040 0.098 |
## home | 0.001 0.023 0.035 |
## healthy | 0.015 0.016 0.489 |
## how | 0.587 0.297 0.002 |
## lunch | 0.000 0.152 0.049 |
## pub | 0.000 0.167 0.008 |
## price | 0.617 0.224 0.148 |
summary(mca3)
##
## Call:
## MCA(X = tea_smaller2, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.172 0.133 0.119 0.113 0.101 0.094 0.079
## % of var. 17.193 13.350 11.935 11.267 10.062 9.371 7.921
## Cumulative % of var. 17.193 30.543 42.478 53.744 63.806 73.177 81.098
## Dim.8 Dim.9 Dim.10
## Variance 0.070 0.064 0.055
## % of var. 7.014 6.369 5.519
## Cumulative % of var. 88.112 94.481 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.705 0.964 0.496 | -0.036 0.003 0.001 | 0.267
## 2 | -0.061 0.007 0.004 | 0.510 0.649 0.266 | 0.119
## 3 | -0.160 0.050 0.034 | -0.400 0.400 0.213 | 0.317
## 4 | -0.333 0.214 0.157 | -0.289 0.208 0.118 | 0.218
## 5 | -0.512 0.509 0.251 | -0.050 0.006 0.002 | 0.263
## 6 | -0.666 0.860 0.494 | -0.410 0.420 0.187 | 0.415
## 7 | -0.666 0.860 0.494 | -0.410 0.420 0.187 | 0.415
## 8 | 0.165 0.053 0.041 | -0.600 0.899 0.542 | -0.272
## 9 | -0.636 0.783 0.462 | -0.272 0.185 0.085 | -0.150
## 10 | -0.431 0.361 0.224 | -0.253 0.159 0.077 | 0.606
## ctr cos2
## 1 0.198 0.071 |
## 2 0.039 0.014 |
## 3 0.281 0.133 |
## 4 0.133 0.067 |
## 5 0.193 0.066 |
## 6 0.481 0.192 |
## 7 0.481 0.192 |
## 8 0.207 0.112 |
## 9 0.063 0.026 |
## 10 1.025 0.441 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## diuretic | 0.409 5.643 0.231 8.311 | 0.241 2.533 0.081
## Not.diuretic | -0.565 7.792 0.231 -8.311 | -0.333 3.498 0.081
## feminine | 0.799 15.957 0.481 11.997 | -0.147 0.694 0.016
## Not.feminine | -0.603 12.038 0.481 -11.997 | 0.111 0.524 0.016
## healthy | 0.239 2.323 0.133 6.309 | -0.120 0.757 0.034
## Not.healthy | -0.557 5.421 0.133 -6.309 | 0.280 1.766 0.034
## exciting | 0.069 0.108 0.003 0.951 | 0.993 28.589 0.622
## No.exciting | -0.044 0.068 0.003 -0.951 | -0.626 18.023 0.622
## iron absorption | 0.767 3.536 0.068 4.503 | 0.130 0.130 0.002
## Not.iron absorption | -0.088 0.407 0.068 -4.503 | -0.015 0.015 0.002
## v.test Dim.3 ctr cos2 v.test
## diuretic 4.906 | 0.277 3.731 0.106 5.630 |
## Not.diuretic -4.906 | -0.383 5.153 0.106 -5.630 |
## feminine -2.205 | -0.373 5.022 0.105 -5.607 |
## Not.feminine 2.205 | 0.282 3.789 0.105 5.607 |
## healthy -3.173 | 0.369 7.977 0.317 9.741 |
## Not.healthy 3.173 | -0.861 18.614 0.317 -9.741 |
## exciting 13.640 | -0.133 0.575 0.011 -1.829 |
## No.exciting -13.640 | 0.084 0.363 0.011 1.829 |
## iron absorption 0.762 | -0.065 0.036 0.000 -0.379 |
## Not.iron absorption -0.762 | 0.007 0.004 0.000 0.379 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## diuretic | 0.231 0.081 0.106 |
## feminine | 0.481 0.016 0.105 |
## healthy | 0.133 0.034 0.317 |
## exciting | 0.003 0.622 0.011 |
## iron.absorption | 0.068 0.002 0.000 |
## escape.exoticism | 0.053 0.029 0.000 |
## relaxing | 0.006 0.438 0.062 |
## sophisticated | 0.173 0.002 0.106 |
## sex | 0.304 0.070 0.240 |
## slimming | 0.267 0.041 0.245 |
# visualize MCA
plot(mca2, invisible=c("ind"), graph.type = "classic")
plot(mca3, invisible=c("ind"), graph.type = "classic")
MCA 2: The categories “tea shop,” “upscale price,” and “unpackaged” appear closely related, indicating they collectively represent a particular type of behavior or consumer preference. Similarly, “tea bag” and “chain store” seem to indicate anothe pattern within the dataset.
MCA 3: Categories associated with tea benefits like “iron absorption,” “healthy,” “sophisticated,” and “diuretic” form a cluster, suggesting they share similar characteristics or appeal to a specific segment of consumers. Oppositely, tea without special benefits stands apart, indicating a different category or behavior. Moreover, the two categories “exciting” and “not relaxing” but also the two categories “feminine tea” and “female” show similarities, aligning with their expected conceptual correspondence in this context.
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.